1 Background

The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 112 million cases have been confirmed worldwide, with nearly 2.5 million associated deaths. Within the US alone, there have been over 500,000 deaths and upwards of 28 million cases reported. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.

The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.

We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.

There are two main goals for this case study.

  1. We show the dynamic evolvement of COVID cases and COVID-related death at state level.
  2. We try to figure out what county-level demographic and policy interventions are associated with mortality rate in the US. We try to construct models to find possible factors related to county-level COVID-19 mortality rates.

Remark: please keep track with the most updated version of this write-up.

2 Data Summary

The data comes from several different sources:

  1. County-level infection and fatality data - This dataset gives daily cumulative numbers on infection and fatality for each county.
  2. County-level socioeconomic data - The following are the four relevant datasets from this site.
    1. Income - Poverty level and household income.
    2. Jobs - Employment type, rate, and change.
    3. People - Population size, density, education level, race, age, household size, and migration rates.
    4. County Classifications - Type of county (rural or urban on a rural-urban continuum scale).
  3. Intervention Policy Data - This dataset is a manually compiled list of the dates that interventions/lockdown policies were implemented and lifted at the county level.

3 EDA

In this case study, we use the following three cleaned data:

  • covid_county.csv: County-level socialeconomic information that combines the above-mentioned 4 datasets: Income (Poverty level and household income), Jobs (Employment type, rate, and change), People (Population size, density, education level, race, age, household size, and migration rates), County Classifications
  • covid_rates.csv: Daily cumulative numbers on infection and fatality for each county
  • covid_intervention.csv: County-level lockdown intervention.

Among all data, the unique identifier of county is FIPS.

The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.

First read in the data.

# county-level socialeconomic information
county_data <- fread("data/covid_county.csv") 
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates 
covid_intervention <- fread("data/covid_intervention.csv")

3.1 Understand the data

The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.

head(covid_intervention)
head(county_data)
head(covid_rate)
summary(covid_rate)
The three datasets are "county_data", "covid_intervention" and "covid_rate". The first two share in common that each ovservation is a US county. County_data is an extremely wide dataset, it has 205 variables describing each county's socioeconomic demographics. Most variables have numeric values, but there are a few categorical variables describing the county's Type (e.g. Type_2015_Recreation_NO - whether the county mainly relies on recreation industry; RuralUrbanContinuumCode2013 - where the county lands on the Rural-urban continuum). Covid_intervention describes each county's measures towards covid. The variables are dates on which the county stopped and resumed "gathering", "public schools", "travel ban", "entertainment" and etc.
The third dataset, covid_rate, is extremely long. Because each observation is one day, and there is cumulative case total and death total for all the counties in all the US states from January 2020 to Feburary 2021 (for some counties, there is only data available for part of the timecourse). There's also an estimate of total population of that county in 2019.

3.2 COVID case trend

It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.

  1. Plot new COVID cases in NY, WA and FL by state and by day. Any irregular pattern? What is the biggest problem of using single day data?
covid_rate_sub <- covid_rate %>% filter(State=="New York" | State == "Washington" | State == "Florida") 
day_state <- covid_rate_sub %>% group_by(State,date) %>% summarize(cum_cases_state = sum(cum_cases))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
As shown in the r-code above, we first created a subset of covid_rate containing only NY, WA and FL. In order to get a new column "new cases by day by state", we will first need to sum cumylative cases in all the counties in the state to get the cumulative cases of the state. We did that by group covid_rate by state and by date then get a summarized sum. 
diff_day<-rep(0,1112) #1112 is the length of daily_increment_State
daily_increment <- cbind(day_state, diff_day)
## New names:
## * NA -> ...4
split_by_state <- split(daily_increment, daily_increment$State)
split_by_state <- lapply(split_by_state, function(x) transform(x, diff_day= c(NA, diff(cum_cases_state))))
daily_increment <- do.call(rbind, split_by_state)
Then, in order to get the difference of cum_cases(cumulative total of cases) between consecutive days, we first created a column "diff_day", filled it with zeros and appended it to our long dataset daily_increment_State. We then splitted the long dataset by State into 3 shorter datasets about each State. For each of the 3 States, we filled the first grid of 'diff_day' column with N/A since we don't have data about the date prior to the first day listed, and the following grids with the difference between cum_cases  in this row and cum_cases in the previous row with the diff() function.  An important assumption is that, within data of each state, covid-rate is already sorted by date. Conveniently, this assumption is already satisfied. 
daily_increment_plot <- daily_increment %>%
  ggplot(aes(x = date, y = diff_day, col = State)) + 
  geom_line() + 
  geom_point() +
  theme_bw() +
ggtitle("Daily New COVID cases in NY, FL and WA")
ggplotly(daily_increment_plot +
           theme(legend.position = "none"))
The biggest difficulty to interpretation that the new cases by day graph presented is that the lines are not "smooth" - too many zigzagging. This means that this visualization might be presenting disturbances/noises that might distract our focus. For example, there is a big notable up-down for Florida between January 1st 2021 and January 2nd 2021. It is currently saying that there is 0 new cases on January 1st 2021 and over 30,000 new cases on January 2nd 2021. It is probably due to the fact that people didn't record the new cases on January 1st 2021, but that big jump is indeed disturbing.
A separate problem is that in the current analysis, we didn't take into account the population differences by the three states at all. So we wouldn't be able to tease apart how much of the smaller peak of Washington is due its smaller base population.
  1. Create weekly new cases per 100k weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.
week_state <- covid_rate %>% group_by(State,week)%>% mutate(cum_cases_week_state = sum(cum_cases))%>% ungroup() %>%
  group_by(State) %>% mutate(statePopEst2019 = sum(TotalPopEst2019))%>% ungroup() %>% 
  select(State,week, cum_cases_week_state, statePopEst2019) %>% distinct(State, week, .keep_all = TRUE)
diff_week<-rep(0,2518) #2518 is the length of week_state
weekly_increment <- cbind(week_state, diff_week)
split_by_state <- split(weekly_increment, weekly_increment$State)
split_by_state <- lapply(split_by_state, function(x) transform(x, diff_week= c(NA, diff(cum_cases_week_state))))
weekly_increment <- do.call(rbind, split_by_state)
weekly_increment <- weekly_increment %>% mutate(weekly_new_per100k = diff_week * 100000 /statePopEst2019 )
In order to create the variable "weekly new per 100k", we first get cumulative total of every week for every state. We also need to summarize the total population in 2019 of all the counties in a state to get the population of a state. Once we get weekly cumulative and population by state, we repeat the process that we used to obtain daily new cases above to get weekly new cases of very state. The new variable "weekly new per 100k" for every state would be weekly new cases divided by the state's total population then multiplying 100,000.   
weekly_increment_plot <- weekly_increment %>%
  ggplot(aes(x = week, y = weekly_new_per100k, group = State, col = State)) + 
  geom_line() + 
  geom_point() +
  theme_bw()+
  ggtitle("Weekly New COVID Cases per 100k")+
  theme(legend.position = "none")
weekly_increment_plot

We noticed that during earlier weeks of, there appeared to be negative weekly new cases, which is probably due to misreport. In order to more clearly examine the trend visually, we exclude all data point that showed negative weekly increase.
weekly_increment_excl <- weekly_increment %>% filter(weekly_new_per100k >= 0)
weekly_increment_plot_excl <- weekly_increment_excl %>%
  ggplot(aes(x = week, y = weekly_new_per100k, col = State)) + 
  geom_line() + 
  geom_point() +
  theme_bw()+
  ggtitle("Weekly New COVID Cases per 100k")
ggplotly(weekly_increment_plot_excl +
           theme(legend.position = "none"))
  1. Summarize the COVID case trend among states based on the plot in ii). What could be the possible reasons to explain the variabilities?
summary(covid_intervention)
##       FIPS          STATE            AREA_NAME          stay at home       
##  Min.   : 1000   Length:3272        Length:3272        Min.   :2020-03-18  
##  1st Qu.:19026   Class :character   Class :character   1st Qu.:2020-03-25  
##  Median :30022   Mode  :character   Mode  :character   Median :2020-03-29  
##  Mean   :31368                                         Mean   :2020-03-29  
##  3rd Qu.:46101                                         3rd Qu.:2020-04-03  
##  Max.   :72153                                         Max.   :2020-04-08  
##                                                        NA's   :576         
##  >50 gatherings       >500 gatherings      public schools      
##  Min.   :2020-03-16   Min.   :2020-03-12   Min.   :2020-03-16  
##  1st Qu.:2020-03-19   1st Qu.:2020-03-16   1st Qu.:2020-03-17  
##  Median :2020-03-22   Median :2020-03-19   Median :2020-03-18  
##  Mean   :2020-03-22   Mean   :2020-03-20   Mean   :2020-03-19  
##  3rd Qu.:2020-03-26   3rd Qu.:2020-03-26   3rd Qu.:2020-03-20  
##  Max.   :2020-04-03   Max.   :2020-04-03   Max.   :2020-04-03  
##  NA's   :200          NA's   :200                              
##  restaurant dine-in   entertainment/gym    federal guidelines  
##  Min.   :2020-03-13   Min.   :2020-03-13   Min.   :2020-03-17  
##  1st Qu.:2020-03-18   1st Qu.:2020-03-18   1st Qu.:2020-03-17  
##  Median :2020-03-20   Median :2020-03-20   Median :2020-03-17  
##  Mean   :2020-03-20   Mean   :2020-03-21   Mean   :2020-03-17  
##  3rd Qu.:2020-03-22   3rd Qu.:2020-03-24   3rd Qu.:2020-03-17  
##  Max.   :2020-04-05   Max.   :2020-04-07   Max.   :2020-03-17  
##                       NA's   :67                               
##  foreign travel ban   stay at home rollback >50 gatherings rollback
##  Min.   :2020-03-12   Min.   :2020-04-25    Min.   :2020-05-02     
##  1st Qu.:2020-03-12   1st Qu.:2020-05-05    1st Qu.:2020-05-16     
##  Median :2020-03-12   Median :2020-05-16    Median :2020-05-31     
##  Mean   :2020-03-12   Mean   :2020-05-17    Mean   :2020-05-29     
##  3rd Qu.:2020-03-12   3rd Qu.:2020-06-01    3rd Qu.:2020-06-13     
##  Max.   :2020-03-12   Max.   :2020-06-23    Max.   :2020-07-09     
##                       NA's   :604           NA's   :2020           
##  >500 gatherings rollback restaurant dine-in rollback
##  Min.   :2020-05-02       Min.   :2020-04-25         
##  1st Qu.:2020-05-17       1st Qu.:2020-05-05         
##  Median :2020-06-02       Median :2020-05-12         
##  Mean   :2020-05-31       Mean   :2020-05-15         
##  3rd Qu.:2020-06-13       3rd Qu.:2020-05-19         
##  Max.   :2020-07-09       Max.   :2020-07-09         
##  NA's   :2230             NA's   :424                
##  entertainment/gym rollback
##  Min.   :2020-04-25        
##  1st Qu.:2020-05-05        
##  Median :2020-05-12        
##  Mean   :2020-05-16        
##  3rd Qu.:2020-05-22        
##  Max.   :2020-07-09        
##  NA's   :564
The first pattern that attracts our attention is that there are variability of weekly new cases across time: for some period, there is growing number of new cases across weeks, whereas for some period the number of new cases are falling. And this pattern across time, is followed by every state. That is, there are there are apparently three peaks starting respectively around week 12 (April 5-11 2020), week 25 (July 5-11 2020) and week 43 (Nov 8- 14 2020). Peak is where the weekly increase culminate and start to decrease, so in order to find possible reasons to interpret the pattern, we mainly ask 1) what happend 2-3 weeks before the peak to accelerate the growth of new cases? 2) what happend right around the peak to make the curve start to wind down?
The first peak might be the natural culmination since the first detection of the pandemic without intervention. Through summarizing the first variables in covid_intervention, we would notice the mean date for the start of lockdown are all in the latter half of March. This probably explains why the first peak is in early April - as travel and interaction has been largely restricted since the first wave of lock down , weekly new cases begin to decrease.
However, weekly new cases began to be on the rise after it hit bottom in early June (week 21). Many factors may have contributed to this rise: after George Floyd's death on May 26, 2020, there was a peak of demonstrations and protests related to Black Life Matters in late May and early June in US. As shown in the following image from the report "Demonstrations & Political Violence in America: New Data for Summer 2020" by The Armed Conflict Location & Event Data Project (ACLED), in week of June 7, corresponding to our week 21, demonstrations related to Black Life Matters hit a peak, of more than 3000 movements in that week. On top of the series of heated civil movement, we can also learn from the summary of covid_intervention that most states issued policies of "lockdown rollback" in mid-late May. It also contributed to the rise of weekly new cases two weeks later.

“Demonstrations & Political Violence in America: New Data for Summer 2020, The Armed Conflict Location & Event Data Project (ACLED).”

Finally,the second peak passed and hit bottom in week 30, presumably as a consequence of another round of lockdown policies as well as subsided number of movement related to Black Life Matter. However, quickly it was on the rise again towards the biggest third peak. We don't have any lockdown data that could shed insights on the development of the third peak. However, we hypothesized that the large number of political gatherings leading to the presidential election contributed to the sharp growth. The fact that the third wave hit peak in the election week provided support for this hypothesis - since as the results settled, cases began to wind down. And we attributed the subsequent smaller peak on the way down in week 51 as resonance of winter holiday travel.
Other than the three-peak timecourse pattern shared by all the states, there is some variability among the states as well. For some states, the ups and downs in the curve are in really small scale, and in general, weekly new cases have been pretty low (e.g. Vermont, Maine, Washington). Our hypothesis is that their smaller base population, though we are looking at new cases per 100k people, nevertheless contributed to the slower rate of growth of cases. If we take a closer look, the leading states for the three peaks are different. New York is among the highes during the first peak. It makes sense because if the first peak is the culmination of the spreading of the cases through contact and travels without interventions - where else would experience more contact and travel than New York! Leading the second peak are Arizona and Florida and other, mostly Southern, states. A closer look at the graph reveals that for these states, it is their first peak rather than second. It is mostly a factor of both the delay in the first culmination since they are not the travel hubs and also a lax lockdown policy - both Arizona and Florida had their "stay at home" order in April, which is later than the mean. South Dakoda, North Dakoda and Iowa are leading the third peak. Searching them in the covid_intervention dataset, we found that South Dakoda and North Dakoda both have NAs for fields "stay at home", "> 50 gathering" ">500 gatherings" and "entertainment", and Iowa had NAs for "stay at home". If NA means there were never policies spelling out the related lockdown order in these states, one hypothesis is that the lack of strict lockdown policies led to the observation that they are leading the third peak without being prevalent in the previous two waves. 
  1. (Optional) Use covid_intervention to see whether the effectiveness of lockdown in flattening the curve.

3.3 COVID death trend

  1. For each month, plot the monthly deaths per 100k heatmap by state on US map. Use the same color range across months. (Hints: Set limits argument in scale_fill_gradient.)

  2. Use plotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states?

4 COVID factor

We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.

  1. Create the response variable total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.

  2. Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.

    1. Report missing values in your final subset of variables.

    2. In the following anaylsis, you may ignore the missing values.

county_data_sub <- county_data %>%
  select(County, State, FIPS, Deep_Pov_All, PovertyAllAgesPct, PerCapitaInc, UnempRate2019, PctEmpFIRE, PctEmpConstruction, PctEmpTrans, PctEmpMining, PctEmpTrade, PctEmpInformation, PctEmpAgriculture, PctEmpManufacturing, PctEmpServices, PopDensity2010, OwnHomePct, Age65AndOlderPct2010, TotalPop25Plus, Under18Pct2010, Ed2HSDiplomaOnlyPct, Ed3SomeCollegePct, Ed4AssocDegreePct, Ed5CollegePlusPct, ForeignBornPct, Net_International_Migration_Rate_2010_2019, NetMigrationRate1019, NaturalChangeRate1019, TotalPopEst2019, WhiteNonHispanicPct2010, NativeAmericanNonHispanicPct2010, BlackNonHispanicPct2010, AsianNonHispanicPct2010, HispanicPct2010, Type_2015_Update, RuralUrbanContinuumCode2013, UrbanInfluenceCode2013, Perpov_1980_0711, HiCreativeClass2000, HiAmenity, Retirement_Destination_2015_Update)
  1. Use LASSO to choose a parsimonious model with all available sensible county-level information. Force in State in the process. Why we need to force in State?

  2. Use Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process.

  3. If necessary, reduce the model from iv) to a final model with all variables being siginificant at 0.05 level. Are the linear model assumptions all reasonably met?

  4. It has been shown that COVID affects elderly the most. It is also claimed that the COVID death rate among African Americans and Latinxs is higher. Does your analysis support these arguments?

  5. Based on your final model, summarize your findings. In particular, summarize the state effect controlling for others. Provide intervention recommendations to policy makers to reduce COVID death rate.

  6. What else can we do to improve our model? What other important information we may have missed?

Appendix 1: Data description

A detailed summary of the variables in each data set follows:

Infection and fatality data

  • date: Date
  • county: County name
  • state: State name
  • fips: County code that uniquely identifies a county
  • cases: Number of cumulative COVID-19 infections
  • deaths: Number of cumulative COVID-19 deaths

Socioeconomic demographics

Income: Poverty level and household income

  • PovertyUnder18Pct: Poverty rate for children age 0-17, 2018
  • Deep_Pov_All: Deep poverty, 2014-18
  • Deep_Pov_Children: Deep poverty for children, 2014-18
  • PovertyAllAgesPct: Poverty rate, 2018
  • MedHHInc: Median household income, 2018 (In 2018 dollars)
  • PerCapitaInc: Per capita income in the past 12 months (In 2018 inflation adjusted dollars), 2014-18
  • PovertyAllAgesNum: Number of people of all ages in poverty, 2018
  • PovertyUnder18Num: Number of people age 0-17 in poverty, 2018

Jobs: Employment type, rate, and change

  • UnempRate2007-2019: Unemployment rate, 2007-2019

  • NumEmployed2007-2019: Employed, 2007-2019

  • NumUnemployed2007-2019: Unemployed, 2007-2019

  • PctEmpChange1019: Percent employment change, 2010-19

  • PctEmpChange1819: Percent employment change, 2018-19

  • PctEmpChange0719: Percent employment change, 2007-19

  • PctEmpChange0710: Percent employment change, 2007-10

  • NumCivEmployed: Civilian employed population 16 years and over, 2014-18

  • NumCivLaborforce2007-2019: Civilian labor force, 2007-2019

  • PctEmpFIRE: Percent of the civilian labor force 16 and over employed in finance and insurance, and real estate and rental and leasing, 2014-18

  • PctEmpConstruction: Percent of the civilian labor force 16 and over employed in construction, 2014-18

  • PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18

  • PctEmpMining: Percent of the civilian labor force 16 and over employed in mining, quarrying, oil and gas extraction, 2014-18

  • PctEmpTrade: Percent of the civilian labor force 16 and over employed in wholesale and retail trade, 2014-18

  • PctEmpInformation: Percent of the civilian labor force 16 and over employed in information services, 2014-18

  • PctEmpAgriculture: Percent of the civilian labor force 16 and over employed in agriculture, forestry, fishing, and hunting, 2014-18

  • PctEmpManufacturing: Percent of the civilian labor force 16 and over employed in manufacturing, 2014-18

  • PctEmpServices: Percent of the civilian labor force 16 and over employed in services, 2014-18

  • PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18

People: Population size, density, education level, race, age, household size, and migration rates

  • PopDensity2010: Population density, 2010

  • LandAreaSQMiles2010: Land area in square miles, 2010

  • TotalHH: Total number of households, 2014-18

  • TotalOccHU: Total number of occupied housing units, 2014-18

  • AvgHHSize: Average household size, 2014-18

  • OwnHomeNum: Number of owner occupied housing units, 2014-18

  • OwnHomePct: Percent of owner occupied housing units, 2014-18

  • NonEnglishHHPct: Percent of non-English speaking households of total households, 2014-18

  • HH65PlusAlonePct: Percent of persons 65 or older living alone, 2014-18

  • FemaleHHPct: Percent of female headed family households of total households, 2014-18

  • FemaleHHNum: Number of female headed family households, 2014-18

  • NonEnglishHHNum: Number of non-English speaking households, 2014-18

  • HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18

  • Age65AndOlderPct2010: Percent of population 65 or older, 2010

  • Age65AndOlderNum2010: Population 65 years or older, 2010

  • TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year average

  • Under18Pct2010: Percent of population under age 18, 2010

  • Under18Num2010: Population under age 18, 2010

  • Ed1LessThanHSPct: Percent of persons with no high school diploma or GED, adults 25 and over, 2014-18

  • Ed2HSDiplomaOnlyPct: Percent of persons with a high school diploma or GED only, adults 25 and over, 2014-18

  • Ed3SomeCollegePct: Percent of persons with some college experience, adults 25 and over, 2014-18

  • Ed4AssocDegreePct: Percent of persons with an associate’s degree, adults 25 and over, 2014-18

  • Ed5CollegePlusPct: Percent of persons with a 4-year college degree or more, adults 25 and over, 2014-18

  • Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18

  • Ed2HSDiplomaOnlyNum: High school only, adults 25 and over, 2014-18

  • Ed3SomeCollegeNum: Some college experience, adults 25 and over, 2014-18

  • Ed4AssocDegreeNum: Number of persons with an associate’s degree, adults 25 and over, 2014-18

  • Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18

  • ForeignBornPct: Percent of total population foreign born, 2014-18

  • ForeignBornEuropePct: Percent of persons born in Europe, 2014-18

  • ForeignBornMexPct: Percent of persons born in Mexico, 2014-18

  • ForeignBornCentralSouthAmPct: Percent of persons born in Central or South America, 2014-18

  • ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18

  • ForeignBornCaribPct: Percent of persons born in the Caribbean, 2014-18

  • ForeignBornAfricaPct: Percent of persons born in Africa, 2014-18

  • ForeignBornNum: Number of people foreign born, 2014-18

  • ForeignBornCentralSouthAmNum: Number of persons born in Central or South America, 2014-18

  • ForeignBornEuropeNum: Number of persons born in Europe, 2014-18

  • ForeignBornMexNum: Number of persons born in Mexico, 2014-18

  • ForeignBornAfricaNum: Number of persons born in Africa, 2014-18

  • ForeignBornAsiaNum: Number of persons born in Asia, 2014-18

  • ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18

  • Net_International_Migration_Rate_2010_2019: Net international migration rate, 2010-19

  • Net_International_Migration_2010_2019: Net international migration, 2010-19

  • Net_International_Migration_2000_2010: Net international migration, 2000-10

  • Immigration_Rate_2000_2010: Net international migration rate, 2000-10

  • NetMigrationRate0010: Net migration rate, 2000-10

  • NetMigrationRate1019: Net migration rate, 2010-19

  • NetMigrationNum0010: Net migration, 2000-10

  • NetMigration1019: Net Migration, 2010-19

  • NaturalChangeRate1019: Natural population change rate, 2010-19

  • NaturalChangeRate0010: Natural population change rate, 2000-10

  • NaturalChangeNum0010: Natural change, 2000-10

  • NaturalChange1019: Natural population change, 2010-19

  • TotalPop2010: Population size 4/1/2010 Census

  • TotalPopEst2010: Population size 7/1/2010

  • TotalPopEst2011: Population size 7/1/2011

  • TotalPopEst2012: Population size 7/1/2012

  • TotalPopEst2013: Population size 7/1/2013

  • TotalPopEst2014: Population size 7/1/2014

  • TotalPopEst2015: Population size 7/1/2015

  • TotalPopEst2016: Population size 7/1/2016

  • TotalPopEst2017: Population size 7/1/2017

  • TotalPopEst2018: Population size 7/1/2018

  • TotalPopEst2019: Population size 7/1/2019

  • TotalPopACS: Total population, 2014-18 - 5-year average

  • TotalPopEstBase2010: County Population estimate base 4/1/2010

  • NonHispanicAsianPopChangeRate0010: Population change rate Non-Hispanic Asian, 2000-10

  • PopChangeRate1819: Population change rate, 2018-19

  • PopChangeRate1019: Population change rate, 2010-19

  • PopChangeRate0010: Population change rate, 2000-10

  • NonHispanicNativeAmericanPopChangeRate0010: Population change rate Non-Hispanic Native American, 2000-10

  • HispanicPopChangeRate0010: Population change rate Hispanic, 2000-10

  • MultipleRacePopChangeRate0010: Population change rate multiple race, 2000-10

  • NonHispanicWhitePopChangeRate0010: Population change rate Non-Hispanic White, 2000-10

  • NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10

  • MultipleRacePct2010: Percent multiple race, 2010

  • WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010

  • NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native American, 2010

  • BlackNonHispanicPct2010: Percent Non-Hispanic African American, 2010

  • AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010

  • HispanicPct2010: Percent Hispanic, 2010

  • MultipleRaceNum2010: Population size multiple race, 2010

  • WhiteNonHispanicNum2010: Population size Non-Hispanic White, 2010

  • BlackNonHispanicNum2010: Population size Non-Hispanic African American, 2010

  • NativeAmericanNonHispanicNum2010: Population size Non-Hispanic Native American, 2010

  • AsianNonHispanicNum2010: Population size Non-Hispanic Asian, 2010

  • HispanicNum2010: Population size Hispanic, 2010

County classifications: Type of county (rural or urban on a rural-urban continuum scale)

  • Type_2015_Recreation_NO: Recreation counties, 2015 edition

  • Type_2015_Farming_NO: Farming-dependent counties, 2015 edition

  • Type_2015_Mining_NO: Mining-dependent counties, 2015 edition

  • Type_2015_Government_NO: Federal/State government-dependent counties, 2015 edition

  • Type_2015_Update: County typology economic types, 2015 edition

  • Type_2015_Manufacturing_NO: Manufacturing-dependent counties, 2015 edition

  • Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015 edition

  • RecreationDependent2000: Nonmetro recreation-dependent, 1997-00

  • ManufacturingDependent2000: Manufacturing-dependent, 1998-00

  • FarmDependent2003: Farm-dependent, 1998-00

  • EconomicDependence2000: Economic dependence, 1998-00

  • RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003

  • UrbanInfluenceCode2003: Urban influence code, 2003

  • RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013

  • UrbanInfluenceCode2013: Urban influence code, 2013

  • Noncore2013: Nonmetro noncore, outside Micropolitan and Metropolitan, 2013

  • Micropolitan2013: Micropolitan, 2013

  • Nonmetro2013: Nonmetro, 2013

  • Metro2013: Metro, 2013

  • Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013

  • Noncore2003: Nonmetro noncore, outside Micropolitan and Metropolitan, 2003

  • Micropolitan2003: Micropolitan, 2003

  • Metro2003: Metro, 2003

  • Nonmetro2003: Nonmetro, 2003

  • NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003

  • NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003

  • Oil_Gas_Change: Change in the value of onshore oil and natural gas production, 2000-11

  • Gas_Change: Change in the value of onshore natural gas production, 2000-11

  • Oil_Change: Change in the value of onshore oil production, 2000-11

  • Hipov: High poverty counties, 2014-18

  • Perpov_1980_0711: Persistent poverty counties, 2015 edition

  • PersistentChildPoverty_1980_2011: Persistent child poverty counties, 2015 edition

  • PersistentChildPoverty2004: Persistent child poverty counties, 2004

  • PersistentPoverty2000: Persistent poverty counties, 2004

  • Low_Education_2015_update: Low education counties, 2015 edition

  • LowEducation2000: Low education, 2000

  • HiCreativeClass2000: Creative class, 2000

  • HiAmenity: High natural amenities

  • RetirementDestination2000: Retirement destination, 1990-00

  • Low_Employment_2015_update: Low employment counties, 2015 edition

  • Population_loss_2015_update: Population loss counties, 2015 edition

  • Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition

Appendix 2: Data cleaning

The raw data sets are dirty and need transforming before we can do our EDA. It takes time and efforts to clean and merge different data sources so we provide the final output of the cleaned and merged data. The cleaning procedure is as follows. Please read through to understand what is in the cleaned data. We set eval = data_cleaned in the following cleaning chunks so that these cleaning chunks will only run if any of data/covid_county.csv, data/covid_rates.csv or data/covid_intervention.csv does not exist.

# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") & 
                    file.exists("data/covid_rates.csv") & 
                    file.exists("data/covid_intervention.csv"))

We first read in the table using data.table::fread(), as we did last time.

# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))

# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))

# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))

4.1 Clean NYC data

The original NYC data contains more information than we need. We extract only the number of cases and deaths and format it the same as the covid_rates data.

# NYC county fips matching table 
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
                       County = c("BX", "BK", "MN", "QN", "SI"))

# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
                   BX = BX_CASE_COUNT, 
                   BK = BK_CASE_COUNT, 
                   MN = MN_CASE_COUNT, 
                   QN = QN_CASE_COUNT, 
                   SI = SI_CASE_COUNT)]

nyc_case %<>% 
  pivot_longer(cols = BX:SI, 
               names_to = "County",
               values_to = "cases") %>%
  arrange(date) %>%
  group_by(County) %>%
  mutate(cum_cases = cumsum(cases))

# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
                   BX = BX_DEATH_COUNT, 
                   BK = BK_DEATH_COUNT, 
                   MN = MN_DEATH_COUNT, 
                   QN = QN_DEATH_COUNT, 
                   SI = SI_DEATH_COUNT)]

nyc_death %<>% 
  pivot_longer(cols = BX:SI, 
               names_to = "County",
               values_to = "deaths") %>%
  arrange(date) %>%
  group_by(County) %>%
  mutate(cum_deaths = cumsum(deaths))

nyc_rates <- merge(nyc_case, 
                   nyc_death,
                   by = c("date", "County"),
                   all.x= T)

nyc_rates <- merge(nyc_rates,
                   nyc_fips,
                   by = "County")

nyc_rates$State <- "NY"
nyc_rates %<>% 
  select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
  arrange(FIPS, date)

4.2 Continental US cases

We only consider cases and death in continental US. Alaska, Hawaii, and Puerto Rico have 02, 15, and 72 as their respective first 2 digits of their FIPS. We use the %/% operator for integer division to get the first 2 digits of FIPS. All data of counties in NYC are aggregated as County == "New York City" in covid_rates with no FIPS, so we combine the NYC data into covid_rate.

covid_rates <- covid_rates %>% 
  arrange(fips, date) %>%
  filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
  filter(county != "New York City") %>%
  rename(FIPS = "fips",
         County = "county",
         State = "state",
         cum_cases = "cases", 
         cum_deaths = "deaths")


covid_rates$date <- as.Date(covid_rates$date)

covid_rates <- rbind(covid_rates, 
                     nyc_rates)

4.3 COVID date to week

We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).

covid_rates[, week := (interval("2020-01-19", date) %/% weeks(1)) + 1]

4.4 COVID infection/mortality rates

Merge the TotalPopEst2019 variable from the demographic data with covid_rates by FIPS and calculate the cumulative infection/mortality rates for each county (divide the cumulative cases/deaths by county population). FIPS for Kansas city, Missouri, Rhode Island and some others are missing. We drop them for the moment and output the data up to week 57 as covid_rates.csv.

covid_rates <- merge(covid_rates[!is.na(FIPS)], 
                     people[,.(FIPS = as.character(FIPS),
                                   TotalPopEst2019)],
                     by = "FIPS",  
                     all.x = TRUE)

covid_rates %<>% 
  mutate(cum_infection_rate = cum_cases/TotalPopEst2019,
         cum_mortality_rate = cum_deaths/TotalPopEst2019)

fwrite(covid_rates %>% 
         filter(week < 58) %>% 
         arrange(FIPS, date), 
       "data/covid_rates.csv")

4.5 Formatting date in int_dates

We convert the columns representing dates in int_dates to R Date types using as.Date(). We will need to specify that the origin parameter is "0001-01-01". We output the data as covid_intervention.csv.

int_dates <- int_dates[-1,]
date_cols <- names(int_dates)[-(1:3)]
int_dates[, (date_cols) := lapply(.SD, as.Date, origin = "0001-01-01"), 
          .SDcols = date_cols]

fwrite(int_dates, "data/covid_intervention.csv")

4.6 NA in COVID data

NA values in the covid_rates data set correspond to a county not having confirmed cases/deaths. We replace the NA values in these columns with zeros.

covid_rates$cum_cases[is.na(covid_rates$cum_cases)] <- 0
covid_rates$cum_deaths[is.na(covid_rates$cum_deaths)] <- 0

4.7 Merge demographic data

Merge the demographic data sets by FIPS and output as covid_county.csv.

countydata <-
  merge(x = income,
        y = merge(
          x = people,
          y = jobs,
          by = c("FIPS", "State", "County")),
        by = c("FIPS", "State", "County"),
        all = TRUE)


countydata <- 
  merge(
    x = countydata,
    y = county_class %>% rename(FIPS = FIPStxt), 
    by = c("FIPS", "State", "County"),
    all = TRUE
  )

# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")